RED WINE QUALITY DATA EXPLORATION by SIDDHARTH SHANKAR

Abstract

I will analyze the Red Wine Dataset. Key goals of the study are to understand which chemical properties influence the quality of red wines and its correlation among them.

Introduction

About the data: The red wine data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Number of Attributes: 11 + output attribute
Attribute information:
Input variables (based on physicochemical tests):

  1. fixed acidity (tartaric acid - g / dm^3)
  2. volatile acidity (acetic acid - g / dm^3)
  3. citric acid (g / dm^3)
  4. residual sugar (g / dm^3)
  5. chlorides (sodium chloride - g / dm^3
  6. free sulfur dioxide (mg / dm^3)
  7. total sulfur dioxide (mg / dm^3)
  8. density (g / cm^3)
  9. pH
  10. sulphates (potassium sulphate - g / dm3)
  11. alcohol (% by volume)
    Output variable (based on sensory data):
  12. quality (score between 0 and 10)

Description of attributes:

  1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
  2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
  3. citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  4. residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  5. chlorides: the amount of salt in the wine
  6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
  7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
  11. alcohol: the percent alcohol content of the wine
    Output variable (based on sensory data):
  12. quality (score between 0 and 10)

Univariate Plots Section

Brief Data & Summary of the dataset

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Wine Quality Distribution

Wine quality is a categorical discrete variable and it ranges from 3 to 8 in the given dataset. There are exceptionally no good or bad wines. Treating the data as continuous will give the mean as 5.636 and median as 6 .

Chemical Properties Distribution

Fixed Acidity appears to be largely positively skewed with a mean of 8.32 and median 7.90.

The mean and median of pH are approximately equal with the values of 3.311 and 3.310 respectively which denotes that pH is normally distributed. Also, a little research online showed that red wines has a pH value range from 3.3 to 3.6.

Univariate Analysis

We plotted the histograms of 11 different chemical properties of red wine to get an idea of the dispersion of each properties. Based on the histograms plotted above, the following observations can be made on the distribution of chemical properties:

  1. Normally Distributed: Volatile Acidity, Density, pH
  2. Positively Skewed: Fixed Acidity, Citric Acid, Free Sulfur Dioxide, Total Sulfur Dioxide, Sulphates, Alcohol
  3. Long Tail: Residual Sugar, Chlorides

Large outliers can be seen in positively skewed and long tailed variables. We will transform some of them to normal distribution by taking log10 which will produce a relatively normal distribution.

What is the structure of your dataset?

There are 1,599 observations with 11 attributes (11 variables on the chemical properties of the wine) + 1 output attribute (quality of red wine).

What is/are the main feature(s) of interest in your dataset?

The quality rating is the main feature in the dataset which defines the good and bad taste of the red wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Based on the above distributions, I think that fixed acidity, citric acid, residual sugar, pH, chlorides will be the features of interest.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Some of the distributions were positively skewed and long tailed which I have transformed to produce a relatively normal distribution.

Bivariate Plots Section

Let’s begin with examining the correlation between two variables using correlation plot.

Dataset of correlation coefficient for Bivariates

##                     Var1                 Var2   Freq
## 52                    pH          citric.acid -0.542
## 130          citric.acid                   pH -0.542
## 32           citric.acid     volatile.acidity -0.552
## 45      volatile.acidity          citric.acid -0.552
## 23               density        fixed.acidity  0.668
## 92  total.sulfur.dioxide  free.sulfur.dioxide  0.668
## 105  free.sulfur.dioxide total.sulfur.dioxide  0.668
## 114        fixed.acidity              density  0.668
## 18           citric.acid        fixed.acidity  0.672
## 44         fixed.acidity          citric.acid  0.672
## 24                    pH        fixed.acidity -0.683
## 128        fixed.acidity                   pH -0.683
## 1                      X                    X  1.000
## 16         fixed.acidity        fixed.acidity  1.000
## 31      volatile.acidity     volatile.acidity  1.000
## 46           citric.acid          citric.acid  1.000
## 61        residual.sugar       residual.sugar  1.000
## 76             chlorides            chlorides  1.000
## 91   free.sulfur.dioxide  free.sulfur.dioxide  1.000
## 106 total.sulfur.dioxide total.sulfur.dioxide  1.000
## 121              density              density  1.000
## 136                   pH                   pH  1.000
## 151            sulphates            sulphates  1.000
## 166              alcohol              alcohol  1.000
## 181              quality              quality  1.000
## 182           numquality              quality  1.000
## 195              quality           numquality  1.000
## 196           numquality           numquality  1.000

The top 4 chemical properties that are correlated are:

  1. fixed acidity & pH with the correlation coefficient of -0.683 stating that pH tends to decrese with increase in fixed.acidity

## [1] -0.6829782

2.citric acid & volatile acidity with the correlation coefficient of -0.552 stating that citric acid tends to decrease with increase in volatile acidity

## [1] -0.5524957

3.citric acid & pH with the correlation coefficient of -0.542 (slightly weaker) stating that pH tends to decrease with increase in citric acid

## [1] -0.5419041
  1. citric acid & fixed acidity with the correlation coefficient of 0.672 stating that fixed acidity increases with increase in citric acid

## [1] 0.6717034

It can be observed that citric acid is a subset of fixed acidity.

Let us now abserve the boxplots of the selected variables and its median will give a better measure of variance in the dataset.

Higher quality wine tend to have higher alcholol content as compared to low quality wines.

Volatile acidity decreases as the wine grades increases. Volatile acidity is responsible for the smell in wine and too much of it will reduce the wine quality.

Citric acid greatly affects the quality of wine. In low grade red wines, its median is almost pointing to 0 while a well balanced citric acid increases the quality of wine.

Though sulphates are used to maintain the freshness of wines, higher the presence of sulphates in wines, increases the wine graded.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Volatile acidity is responsible for the aroma of wine and is not intentionally included in the wine. It can be observed from the boxplot of volatile acidity and wine grade that higher the volatile acidity, lower is the quality of wine and vice-versa. Also, Higher quality of wine tends to have high level of alcohol. The median for sulphates increases for each wine grade (quality).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

When citric acid increases, fixed acidity also increases denoting a positive correlation. Citric acid and volatile acidity are negatively correlated. Citric acid and pH were also negatively correlated – a lower pH indicates a higher acidity.

What was the strongest relationship you found?

pH & Fixed Acidity with the correlation coefficient of -0.683.

Multivariate Plots Section

It can be observed that higher quality wine has lower volatile acidity.

It can be observed that higher quality wine have higher alcohol, lower volatile acidity and higher sulphates.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The multivariate analysis only stregthen the relationship we observed in the bivariate analysis. It depicts that higher quality wine have higher alcohol, lower volatile acidity and higher sulphates.

Were there any interesting or surprising interactions between features?

No


Final Plots and Summary

Plot One

Description One

Plot one shows the distribution of wine quality based on the physicochemical tests. It can be observed that the given dataset of red wine contains a large number of wines that are average in quality. The mean and median of the quality of red wines are 5.636 and 6 respectively.

Plot Two

Description Two

Based on the correlation, the following 4 chemcial properties have the highest correlation coefficient: Alcohol, Volatile Acidity, Citric Acid &Sulphates. Higher the wine grade, higher is the level of alcohol and citric acid. If we group wine grades as bad (3,4), average (5,6) and good (7,8), we can observe that average wines have higher content of sulphates and alcohol in it. Also, the level of sulphates increases slightly in good grade wines which acts as an important role in maintaining the freshness of the wine.

Plot Three

Description Three

It can be observed that higher quality wine has more alcohol content and less volatile.acidity which means that the quality of wine increases with the increase in alcohol and decrease in volatile acidity.

Reflection

The key goals of this study were to understand which chemical properties influence the quality of red wines and its correlation among them. The red wine data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. Initially, when I plotted the histograms of all the 11 variables, based on the nature of the plots, I assumed that some of these variables are related to each other like being directly or inversely proportional or subset which turns out to be true in the correlation analysis. The correlation showed that pH tends to decrease with increase in fixed acidity and citric acid and fixed acidity goes hand in hand i.e. they are positively correlated with a value of 0.6717. After doing some web research I learned about a few things about the presence of different chemical properties in wine.

Volatile acidity has a negative correlation. It refers to the acidic elements of the wine that are gaseous rather than liquid. It is the acetic acid compound which is majorly responsible for the aroma. Though it is not intentionally included in the wine,but is an important characterstic in many wines that adds complexity and interest; often in positive manner.

Presence of alcohol plays an important role in determining the quality of wines. Wines having higher level of alcohol provides rich, ripe fruits flavors. Those flavors come from really ripe grapes, and really ripe grapes come from warmer growing conditions.Those grapes contain more sugar, and more sugar produces more alcohol during fermentation.

The presence of sulphates in wine determine its freshness and based on the correlation the level of sulphates increases with increase in wine quality.

Further improvements can be done, if data for exceptionally good and bad wines are present. However, examining the quality of wine is complex and therefore, apart from chemical properties if more factors such as storage duration, quality and types of grapes, etc. are provided the quality of analysis can be improved.

Problems & Solutions

When I plotted the correlation matrix, all the data were overlapped and it looked messy. A google search showed how to show the data as an ordered list and then I created a correlation matrix, transformed it into a dataframe and ordered the data above a certain value to show only relevent values.

References

[1] https://stackoverflow.com/questions/7074246/show-correlations-as-an-ordered-list-not-as-a-large-matrix/7074856

[2] https://www.decanter.com/learn/volatile-acidity-va-45532/

[3] https://www.thekitchn.com/the-truth-about-sulfites-in-wine-myths-of-red-wine-headaches-100878

[4] https://www.tennessean.com/story/life/food/2015/04/17/alcohol-content-affect-wine/25779589/